{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M1.03\n",
    "\n",
    "The goal of this exercise is to compare the performance of our classifier in\n",
    "the previous notebook (roughly 81% accuracy with `LogisticRegression`) to some\n",
    "simple baseline classifiers. The simplest baseline classifier is one that\n",
    "always predicts the same class, irrespective of the input data.\n",
    "\n",
    "- What would be the score of a model that always predicts `' >50K'`?\n",
    "- What would be the score of a model that always predicts `' <=50K'`?\n",
    "- Is 81% or 82% accuracy a good score for this problem?\n",
    "\n",
    "Use a `DummyClassifier` and do a train-test split to evaluate its accuracy on\n",
    "the test set. This\n",
    "[link](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)\n",
    "shows a few examples of how to evaluate the generalization performance of\n",
    "these baseline models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first split our dataset to have the target separated from the data used to\n",
    "train our predictive model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "data = adult_census.drop(columns=target_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by selecting only the numerical columns as seen in the previous\n",
    "notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_columns = [\"age\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n",
    "\n",
    "data_numeric = data[numerical_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the data and target into a train and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# solution\n",
    "data_numeric_train, data_numeric_test, target_train, target_test = (\n",
    "    train_test_split(data_numeric, target, random_state=42)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a `DummyClassifier` such that the resulting classifier always predict the\n",
    "class `' >50K'`. What is the accuracy score on the test set? Repeat the\n",
    "experiment by always predicting the class `' <=50K'`.\n",
    "\n",
    "Hint: you can set the `strategy` parameter of the `DummyClassifier` to achieve\n",
    "the desired behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "# solution\n",
    "class_to_predict = \" >50K\"\n",
    "high_revenue_clf = DummyClassifier(\n",
    "    strategy=\"constant\", constant=class_to_predict\n",
    ")\n",
    "high_revenue_clf.fit(data_numeric_train, target_train)\n",
    "score = high_revenue_clf.score(data_numeric_test, target_test)\n",
    "print(f\"Accuracy of a model predicting only high revenue: {score:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We clearly see that the score is below 0.5 which might be surprising at first.\n",
    "We now check the generalization performance of a model which always predict\n",
    "the low revenue class, i.e. `\" <=50K\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "class_to_predict = \" <=50K\"\n",
    "low_revenue_clf = DummyClassifier(\n",
    "    strategy=\"constant\", constant=class_to_predict\n",
    ")\n",
    "low_revenue_clf.fit(data_numeric_train, target_train)\n",
    "score = low_revenue_clf.score(data_numeric_test, target_test)\n",
    "print(f\"Accuracy of a model predicting only low revenue: {score:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We observe that this model has an accuracy higher than 0.5. This is due to the\n",
    "fact that we have 3/4 of the target belonging to low-revenue class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Therefore, any predictive model giving results below this dummy classifier\n",
    "would not be helpful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "adult_census[\"class\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "(target == \" <=50K\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "In practice, we could have the strategy `\"most_frequent\"` to predict the class\n",
    "that appears the most in the training target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "most_freq_revenue_clf = DummyClassifier(strategy=\"most_frequent\")\n",
    "most_freq_revenue_clf.fit(data_numeric_train, target_train)\n",
    "score = most_freq_revenue_clf.score(data_numeric_test, target_test)\n",
    "print(f\"Accuracy of a model predicting the most frequent class: {score:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "So the `LogisticRegression` accuracy (roughly 81%) seems better than the\n",
    "`DummyClassifier` accuracy (roughly 76%). In a way it is a bit reassuring,\n",
    "using a machine learning model gives you a better performance than always\n",
    "predicting the majority class, i.e. the low income class `\" <=50K\"`."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}